{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#分类变量处理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "分类变量是经常遇到的问题。一方面它们提供了信息；另一方面，它们可能是文本形式——纯文字或者与文字相关的整数——就像表格的索引一样。\n",
    "\n",
    "因此，我们在建模的时候往往需要将这些变量量化，但是仅仅用简单的`id`或者原来的形式是不行的。因为我们也需要避免在上一节里*通过阈值创建二元特征*遇到的问题。如果我们把数据看成是连续的，那么也必须解释成连续的。\n",
    "\n",
    "<!-- TEASER_END -->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##Getting ready"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里`boston`数据集不适合演示。虽然它适合演示二元特征，但是用来创建分类变量不太合适。因此，这里用`iris`数据集演示。\n",
    "\n",
    "解决问题之前先把问题描述清楚。假设有一个问题，其目标是预测花萼的宽度；那么花的种类就可能是一个有用的特征。\n",
    "\n",
    "首先，让我们导入数据："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "iris = datasets.load_iris()\n",
    "X = iris.data\n",
    "y = iris.target"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "现在`X`和`y`都获得了对应的值，我们把它们放到一起："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "d = np.column_stack((X, y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##How to do it..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面我们把花类型`y`对应那一列转换成分类特征："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1.,  0.,  0.],\n",
       "       [ 1.,  0.,  0.],\n",
       "       [ 1.,  0.,  0.],\n",
       "       [ 1.,  0.,  0.],\n",
       "       [ 1.,  0.,  0.]])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import preprocessing\n",
    "text_encoder = preprocessing.OneHotEncoder()\n",
    "text_encoder.fit_transform(d[:, -1:]).toarray()[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##How it works..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里，编码器为每个分类变量创建了额外的特征，转变成一个稀疏矩阵。矩阵是这样定义的：每一行由0和1构成，对应的分类特征是1，其他都是0。用稀疏矩阵存储数据很合理。\n",
    "\n",
    "`text_encoder`是一个标准的scikit-learn模型，可以重复使用："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.,  1.,  0.],\n",
       "       [ 0.,  1.,  0.],\n",
       "       [ 0.,  1.,  0.]])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_encoder.transform(np.ones((3, 1))).toarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##There's more..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在scikit-learn和Python库中，还有一些方法可以创建分类变量。如果你更倾向于用scikit-learn，而且分类编码原则很简单，可以试试`DictVectorizer`。如果你需要处理更复杂的分类编码原则，`patsy`是很好的选择。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###DictVectorizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`DictVectorizer`可以将字符串转换成分类特征："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1.,  0.,  0.],\n",
       "       [ 1.,  0.,  0.],\n",
       "       [ 1.,  0.,  0.],\n",
       "       [ 1.,  0.,  0.],\n",
       "       [ 1.,  0.,  0.]])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction import DictVectorizer\n",
    "dv = DictVectorizer()\n",
    "my_dict = [{'species': iris.target_names[i]} for i in y]\n",
    "dv.fit_transform(my_dict).toarray()[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">Python的词典可以看成是一个稀疏矩阵，它们只包含非0值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###Pasty"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`patsy`是另一个分类变量编码的包。经常和`StatsModels`一起用，`patsy`可以把一组字符串转换成一个矩阵。\n",
    "\n",
    ">这部分内容与 scikit-learn关系不大，跳过去也没关系。\n",
    "\n",
    "例如，如果`x`和`y`都是字符串，`dm = patsy.design_matrix(\"x + y\")`将创建适当的列。如果不是，`C(x)`将生成一个分类变量。\n",
    "\n",
    "例如，初看`iris.target`，可以把它当做是一个连续变量。因此，用下面的命令处理："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DesignMatrix with shape (150, 3)\n",
       "  C(species)[0]  C(species)[1]  C(species)[2]\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "              1              0              0\n",
       "  [120 rows omitted]\n",
       "  Terms:\n",
       "    'C(species)' (columns 0:3)\n",
       "  (to view full data, use np.asarray(this_obj))"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import patsy\n",
    "patsy.dmatrix(\"0 + C(species)\", {'species': iris.target})"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
